Results and comments

The Medical Expenditure Panel Survey (MEPS) is a medical data set. There is 44 variables which describe the patient’s demographic (age, race, gender), social-economic (education, income, insurance), medical information (diagnosed diseases, symptoms, health status based on self-filled survey) and health expenses. The goal is to predict the cost of healthcare based on description of given patient.

After investigating the variables, there are 8 numerical and 35 categorical independent variables. There are not significantly correlated with the dependent variable. The dependent variable is called ‘HEALTHEXP’ and describes patient’s health expenses. The boxplot of the variable ‘HEALTHEXP’ has a long tail. Therefore the variable was transformed with a logarithm of a base 3 (which is easier to explain than natural logarithm).

Two models were trained for this exercise – XGB model and linear model.

XGB

XGB model was trained with a pipeline. The pipeline one hot encoded the categorical variables and standar scaled the numerical variables. The result on the test set:

  • r2: 0.371
  • rmse: 2.17
  • mae: 1.618

Linear

The lasso linear model was performed. Lasso provides mean for a variable selection. Final model result on the test set (the same as for XGB model):

  • r2: 0.158
  • rmse: 2.511
  • mae: 1.869

The results were worse than for xgb model but the model was more selective with its variables.

3) XGB model prediction and Ceteris Paribus explanation

Patient 3639 is 33 years old man. He has never been married and has a bachelor degree. He suffers from joint pains and asthma. He does not have any other positive diagnosis. He has high income and private insurance.

The XGB model predicted the cost to be 3 to the power of 6.18 while the true value was 3 to the power of 6.51.

The ceteris paribus plot for ‘POVCAT15’ indicates if he was poorer he would pay less for his healthcare (by square root of 3 times less). His age (‘AGE31X’) put him in the middle of the prediction price. If he was younger he would have lower cots and if he was older he would pay more.

If he did not have asthma he would pay 3 to the power of 0.3 times less. Also if he did not have joints pain (‘JTPAIN31’) he would pay less for his health expenses. If he had a lower education record he would spend less on his healthcare (CP plot for categorical variable called ‘EDRECODE’). If the patient had no insurance his cots would be 9 times slower but if he had the public insurance his cost would stay the same. Modeled lowered the prediction value by 3 to the power of 1.2 times due the patient gender.

4) XGB model 2 observations with different Ceteris Paribus explanations

Patient 975 is a young (22 years old) man. He has never been married. He graduated from high school. He is poor and in a good health. He has not been diagnosed with any diseases/condition.

The XGB model prediction was close to the true value which is 0.

The Ceteris Paribus profile for his age (variable ‘AGE31X’) does not change his prediction (up to 3 to the power 0.2 times). The income (variable ‘INCOME_M’) would increase significantly the prediction cost if he would earn more. Similarly, any change in his poor status (‘POVCAT15’) would increase the prediction. Also positive diagnosis for any diseases would increase his medical expensive. Interestingly, if patient would quit his smoking habit he would spend 3 to the power of 0.2 times more on his health expenses (variable ‘ADSMOK42’). This could mean that patient would have higher expenses because he would spend money on his health instead on cigarettes.

On the other hand, there is a patient 896. He has 54 years old and is married. He is also poor but has bigger income that patient 975. He is testes positive for high blood pressure, high cholesterol, diabetes, joint pain, arthritis, asthma, walking limitations, cognitive limitations. He does not smoke. He has a public insurance.

XGB model predicted that the patient would spent 3 to the power of 9 on health expenses and the true value was 9 to the power of 8.82.

The CP profile for his age show that if he was 20 years old or younger he would have costs lowered 3 times. Also if he would be older he would have lower costs. The age variable effect is more significant in comparison to the patient 975. The income variable would lowered his prediction. This is a opposed behavior in comparison to the CP profile of patient 975. Similarly the ‘POVCAT15’ variable has different effect on patient 896, it would first decrease and after passing value 3 would increase prediction. If the patient 886 has not been diagnosed with the high cholesterol (‘CHOLDX’) he would have the prediction cost 3 to the power of 0.48 times lower prediction. Similarly effect is observed of the variable describing his positive test for diabetes (‘DIABDX’) and cognitive limitations (‘COGLIM31’).

The same as for the patient 975, patient 896 would spent less on his healthcare if he smoked.

5) One observations - XGB model Ceteris Paribus profile and linear model Ceteris Paribus profile

The patient 704 is 59 years old woman. She has never been married. She has a public insurance and low income. She is testes positive for high blood pressure, high cholesterol, heart diseases, joint pain, arthritis. She has walking, cognitive and social limitations. She is not smoking.

Her health expenses (rounded to first decimal):

  • true value: 3 to the power of 8.0
  • xgb model prediction: 3 to the power of 8.2
  • linear model prediction: 3 to the power 7.7

The prediction of XGB model was 0.2 times higher and the prediction of linear model was 0.3 times lower.

Some different CP profile variables explanations:

  • ‘AGE31X’ variable would lower the prediction if the patient was older in XGB model. In the linear model, the prediction would increase the prediction value.
  • ‘MNHLTH31’ indicates perceived mental health status. Patient 704 has good mental health status. In XGB model the lower perceived value the higher cost and the better perceived value the smaller cost of healthcare. For the linear model, the variable is stable.
  • in linear model with lasso the categorical variables does not effect the dependent variable. The Ceteris Paribus plots for those variables does not change the prediction for the patient, while in XGB model the prediction is effect by categorical variables. For example variable ‘JTPAIN31’ which indicates positive result for joints pain. The patient would have lower cost by 3 to the power of 0.2 times if she had not have joint paints.

MEPS Data

Variables exploration

REGION AGE31X GENDER RACE3 MARRY31X EDRECODE FTSTU31X ACTDTY31 HONRDC31 RTHLTH31 ... ADSMOK42 PCS42 MCS42 K6SUM42 PHQ242 EMPST31 POVCAT15 INSCOV15 INCOME_M HEALTHEXP
0 2 52 0.0 0.0 5 13 -1 2 2 4 ... 2 25.93 58.47 3 0 4 1 2 11390.0 46612
1 2 55 1.0 0.0 3 14 -1 2 2 4 ... 2 20.42 26.57 17 6 4 3 2 11390.0 9207
2 2 22 1.0 0.0 5 13 3 2 2 1 ... 2 53.12 50.33 7 0 1 2 2 18000.0 808
3 2 2 0.0 0.0 6 -1 -1 3 3 1 ... -1 -1.00 -1.00 -1 -1 -1 2 2 385.0 2721
4 3 25 1.0 0.0 1 14 -1 2 2 1 ... 2 59.89 45.91 9 2 1 3 1 3700.0 1573

5 rows × 44 columns

Numerical variables (8): ['AGE31X', 'PCS42', 'MCS42', 'K6SUM42', 'INCOME_M', 'RTHLTH31', 'MNHLTH31', 'POVCAT15']
Categorical variables (35): ['REGION', 'GENDER', 'RACE3', 'MARRY31X', 'EDRECODE', 'FTSTU31X', 'ACTDTY31', 'HONRDC31', 'HIBPDX', 'CHDDX', 'ANGIDX', 'MIDX', 'OHRTDX', 'STRKDX', 'EMPHDX', 'CHBRON31', 'CHOLDX', 'CANCERDX', 'DIABDX', 'JTPAIN31', 'ARTHDX', 'ARTHTYPE', 'ASTHDX', 'ADHDADDX', 'PREGNT31', 'WLKLIM31', 'ACTLIM31', 'SOCLIM31', 'COGLIM31', 'DFHEAR42', 'DFSEE42', 'ADSMOK42', 'PHQ242', 'EMPST31', 'INSCOV15']

Data has a long tail, hence logarithmic (base 3) transformation of explained variable (HEALTHEXP).

Model - XGB and Linear

XGB   results:
training rmse: 2.003098115871223
training r2: 0.4735825352408555
training mae: 1.483736015042057
test rmse: 2.1698000363116203
test r2: 0.37123680240833923
test mae: 1.6178791760858526

Lasso Regression results:
training rmse: 2.547681474733906
training r2: 0.14843833455114608
training mae: 1.913403722669455
test rmse: 2.5113545512647533
test r2: 0.15770591573054238
test mae: 1.8692889766558773

Explaining model

Ceteris Paribus explanations for XGB

Preparation of a new explainer is initiated

  -> data              : 14680 rows 43 cols
  -> target variable   : Argument 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 14680 values
  -> model_class       : sklearn.pipeline.Pipeline (default)
  -> label             : MEPS
  -> predict function  : <function yhat_default at 0x7fbd82a9b790> will be used (default)
  -> predicted values  : min = -0.6760427, mean = 5.708615, max = 9.547858
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -7.599170684814453, mean = 0.0016496053506552864, max = 6.528248581332743
  -> model_info        : package sklearn

A new explainer has been created!
Patien no:  3639  , prediction value:  6.1777163  true value:  [6.50884896] 


Calculating ceteris paribus!: 100%|██████████| 43/43 [00:00<00:00, 50.93it/s]

3

Patient 3639 has rather higher prediction since he has joint pain (JTPAIN31), asthma diagnosis (ASTHDX), low overall ratings of feelings (PHQ242) and high income (POVCAT15) and has a private insurance coverage (INSCOV15). Lack of other positive diagnoses decreases the prediction.

Patien no:  975  , prediction value:  0.09343961  true value:  [0.] 


Calculating ceteris paribus!: 100%|██████████| 43/43 [00:01<00:00, 29.96it/s]

Patient number 975 has a zero cost prediction since he is not diagnosed with any disease. Each positive diagnosis would increase his payment.

Patien no:  896  , prediction value:  9.002718  true value:  [8.82324185] 


Calculating ceteris paribus!: 100%|██████████| 43/43 [00:01<00:00, 26.40it/s]

Patient number 896 has a high prediction since he is tested positive on high blood pressure, diabetes, asthma, has limitations in physical as well as in work/house/school functioning. He has rather low value of physical summary.

Comparing CP explanations for XGB with Lasso Regression

Creating explainer for linear model
Preparation of a new explainer is initiated

  -> data              : 14680 rows 43 cols
  -> target variable   : Argument 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 14680 values
  -> model_class       : sklearn.linear_model._coordinate_descent.Lasso (default)
  -> label             : MEPS
  -> predict function  : <function yhat_default at 0x7fbd82a9b790> will be used (default)
  -> predicted values  : min = 3.4357595877232816, mean = 5.710264253185743, max = 9.380884809389386
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -8.09980910468082, mean = -5.145142562077565e-16, max = 6.57191322179619
  -> model_info        : package sklearn

A new explainer has been created!

XGB explanation

Patien no:  704  , prediction value:  8.208433  true value:  [8.04268481] 


Calculating ceteris paribus!: 100%|██████████| 43/43 [00:00<00:00, 48.69it/s]

Lasso explanation

Patien no:  704  , prediction value:  7.749701969732133  true value:  [8.04268481] 


Calculating ceteris paribus!: 100%|██████████| 43/43 [00:00<00:00, 141.90it/s]